Introduction

The following report focuses on analyzing the World Development Indicators (WDI) data about the prosperity of many countries in the world with a goal to find some interesting correlations and observations. Additionally, an attempt was made to make a regressor that can predict the price of Bitcoin based on this data (as well as data about gold prices, S&P Composite values, and currency values).

The data has been gathered mainly by the World Bank and it contains more than 200 statistics that measure general market development among most countries in the world.

Executive summary

First, after reading and cleaning the data, each dataset was thoroughly analyzed and looked at. Missing values were filled and additional filtering was performed to reduce the size of datasets if necessary.

Then, there was a section with a primary focus to find interesting correlations in the WDI dataset. Correlations were calculated separately for each region and the entire dataset as well. Similar patterns were found in each of them. Most strong correlations that were observed were trivial (e.g. population in total growing with the population in urban or rural areas).

Some were more interesting - like a moderate correlation between the number of under-five deaths (number of children dying before reaching age five) and the total population. It is suggesting that medicine is getting more and more advanced as years go by. The global under-five mortality rate declined by 59 percent since 1990.

Another finding that was quite alarming is that there is a very strong correlation between a total population and the amount of different gases that countries produce that influence climate change and global warming.

There was also a brief overview of the effect that the COVID-19 pandemic had on the global economy. It was observed that there was around a 5% drop in the GDP on average in 2020.

The last section focused on finding a model that could predict the price of Bitcoin. Two algorithms were tried - Stochastic Gradient Boosting and Quantile Random Forest. Both of them performed similarly, but the first one was chosen at the end (based on the R squared, RMSE, and MAE metrics).

After analyzing the variable importance it was found that the most important attribute in predicting the price was the difficulty of finding a new block in the blockchain network.

The final results were not entirely satisfying (RMSE equals almost 2000 and the R squared value around 0.36) and the conclusion was that the dataset does not capture all factors that affect the market price of Bitcoin.

Besides, time series modeling with ARIMA was tried and compared with the regressor. It had an RMSE value around 1300, but it was used to predict only the next 20 values (versus the 813 in the test set for the regressor) and so the comparison would not be very accurate.

Libraries used

  • readxl
  • dplyr
  • tidyr
  • ggplot2
  • ggcorrplot
  • plotly
  • kableExtra
  • caret
  • gbm
  • quantregForest
  • zoo
  • forecast

Ensuring repeatability

set.seed(42)

Reading and cleaning the data

First, all available datasets are read into memory. To make the analysis simpler, some columns are renamed and their type changed. Additionally, blockchain data was scattered among many files. Therefore, they are joined, so that they fit a single data frame.

ex_rates <- read.csv("data/CurrencyExchangeRates.csv", colClasses = c(Date = "POSIXct"))
gold_prices <- read.csv("data/Gold prices.csv", colClasses = c(Date = "POSIXct"))
sp_composite <- read.csv("data/S&P Composite.csv", colClasses = c(Year = "POSIXct")) %>% rename(Date = Year)

bchain_difficulty <- read.csv("data/Bitcoin/BCHAIN-DIFF.csv", colClasses = c(Date = "POSIXct"))
bchain_hash_rate <- read.csv("data/Bitcoin/BCHAIN-HRATE.csv", colClasses = c(Date = "POSIXct"))
bchain_market_price <- read.csv("data/Bitcoin/BCHAIN-MKPRU.csv", colClasses = c(Date = "POSIXct"))
bchain_trade_volume <- read.csv("data/Bitcoin/BCHAIN-TRVOU.csv", colClasses = c(Date = "POSIXct"))

bchain <- bchain_difficulty %>% 
  inner_join(bchain_hash_rate, by = "Date") %>%
  inner_join(bchain_market_price, by = "Date") %>%
  inner_join(bchain_trade_volume, by = "Date") %>%
  rename(Difficulty = 2, Hash_rate = 3, Market_price = 4, Trade_volume = 5)

wdi <- read_excel(
  path = "data/World_Development_Indicators.xlsx",
  sheet = 1,
  na = "..",
  n_max = 44304
) %>%
  rename_with(~ substr(.x, 1, 4), contains("YR"))

The original data about World Development Indicators does not include any information about the region that a country belongs to. Therefore, additional dataset was obtained from The World Bank website in order to make the analysis simpler.

class <- read_excel("data/CLASS.xlsx")

wdi <- wdi %>% 
  left_join(class %>% select(Economy, Region), by = c("Country Name" = "Economy"))

This dataset is still “dirty”. Each year is a separate column, so those columns were “gathered” into rows (so that there is a single “Year” column now). Then each indicator value is “spread”, so that there is a separate column representing each indicator, decreasing the overall row count.

wdi <- wdi %>% 
  gather(key="Year", value="Value", ("1970":"2020")) %>%
  mutate(Year = as.numeric(Year)) %>%
  select(-`Series Name`) %>% 
  spread(key=`Series Code`, value = Value)

Dataset overview

Exchange rates

This dataset contains information about exchange rates. Prices are presented relative to the USD currency. It has 52 attributes and 5978 observations. Only some currencies are shown for brevity.

Date Polish.Zloty Euro U.S..Dollar
Min. :1995-01-02 00:00:00 Min. :2.022 Min. :0.8252 Min. :1
1st Qu.:2000-10-05 06:00:00 1st Qu.:3.033 1st Qu.:1.0889 1st Qu.:1
Median :2006-07-06 12:00:00 Median :3.290 Median :1.2295 Median :1
Mean :2006-07-27 21:33:31 Mean :3.365 Mean :1.2076 Mean :1
3rd Qu.:2012-05-07 18:00:00 3rd Qu.:3.822 3rd Qu.:1.3338 3rd Qu.:1
Max. :2018-05-02 00:00:00 Max. :4.500 Max. :1.5990 Max. :1
NA NA’s :1765 NA’s :1070 NA

Missing values

For further analysis, it is considered satisfactory to only focus on one currency. Euro is chosen because of its popularity and the fact that it is mostly free from missing values. Most of the missing values are in years previous than 2000 and the analysis will probably not look that far. Missing values are replaced using the previous entry.

ex_rates <- ex_rates %>% 
  select(Date, Euro) %>%
  fill(Euro, .direction = "updown")

Quick peek

S&P Composite

This dataset contains information about monthly S&P Composite values starting from 1871. There are 1810 records in total.

Date S.P.Composite Dividend Earnings CPI Long.Interest.Rate Real.Price Real.Dividend Real.Earnings Cyclically.Adjusted.PE.Ratio
Min. :1871-01-31 00:00:00 Min. : 2.730 Min. : 0.1800 Min. : 0.1600 Min. : 6.28 Min. : 0.620 Min. : 73.9 Min. : 5.445 Min. : 4.576 Min. : 4.784
1st Qu.:1908-10-07 18:00:00 1st Qu.: 7.902 1st Qu.: 0.4202 1st Qu.: 0.5608 1st Qu.: 10.20 1st Qu.: 3.171 1st Qu.: 186.6 1st Qu.: 9.417 1st Qu.: 14.063 1st Qu.:11.898
Median :1946-06-15 00:00:00 Median : 17.370 Median : 0.8717 Median : 1.4625 Median : 20.35 Median : 3.815 Median : 283.3 Median :14.411 Median : 23.524 Median :16.381
Mean :1946-06-15 19:15:13 Mean : 327.968 Mean : 6.7321 Mean : 15.3714 Mean : 62.39 Mean : 4.504 Mean : 622.0 Mean :17.498 Mean : 34.907 Mean :17.215
3rd Qu.:1984-02-21 18:00:00 3rd Qu.: 164.400 3rd Qu.: 7.0525 3rd Qu.: 14.7258 3rd Qu.:102.28 3rd Qu.: 5.139 3rd Qu.: 707.0 3rd Qu.:22.301 3rd Qu.: 43.768 3rd Qu.:20.913
Max. :2021-10-31 00:00:00 Max. :4493.280 Max. :59.6800 Max. :158.7400 Max. :273.98 Max. :15.320 Max. :4477.2 Max. :63.511 Max. :159.504 Max. :44.198
NA NA NA’s :4 NA’s :4 NA NA NA NA’s :4 NA’s :4 NA’s :120

Missing values

Most columns do not have missing values. Four columns have just 4 missing values. Only one column has 120 missing values. They are replaced using the previous entry.

sp_composite <- sp_composite %>%
  fill(Dividend, Earnings, Real.Dividend, Real.Earnings, Cyclically.Adjusted.PE.Ratio, .direction = "updown")

Quick peek

Bitcoin

This dataset contains information about Bitcoin’s price changing in time as well as some blockchain technology attributes:

  • Difficulty - a relative measure of how difficult it is to find a new block. The difficulty is adjusted periodically as a function of how much hashing power has been deployed by the network of miners.
  • Hash rate - the estimated number of tera hashes per second (trillions of hashes per second) the Bitcoin network is performing.
  • Market price - average USD market price across major bitcoin exchanges.
  • Trade volume - the total USD value of trading volume on major bitcoin exchanges.

It has 4659 rows in total.

Date Difficulty Hash_rate Market_price Trade_volume
Min. :2009-01-03 00:00:00 Min. :0.000e+00 Min. : 0 Min. : 0.00 Min. :0.000e+00
1st Qu.:2012-03-12 12:00:00 1st Qu.:1.689e+06 1st Qu.: 12 1st Qu.: 7.21 1st Qu.:1.948e+05
Median :2015-05-21 00:00:00 Median :4.881e+10 Median : 356089 Median : 431.89 Median :6.824e+06
Mean :2015-05-21 00:24:32 Mean :3.665e+12 Mean : 26458258 Mean : 5132.38 Mean :1.467e+08
3rd Qu.:2018-07-28 12:00:00 3rd Qu.:5.364e+12 3rd Qu.: 38265984 3rd Qu.: 6496.35 3rd Qu.:1.484e+08
Max. :2021-10-05 00:00:00 Max. :2.505e+13 Max. :198514006 Max. :63554.44 Max. :5.352e+09

Missing values

Since inner join operations were performed shortly after reading the data, there are no missing values here.

Quick peek

Gold prices

This dataset contains information about the price of gold across multiple years, starting from 1968. There are 13585 rows.

Date USD..AM. USD..PM. GBP..AM. GBP..PM. EURO..AM. EURO..PM.
Min. :1968-01-02 00:00:00 Min. : 34.77 Min. : 34.75 Min. : 14.48 Min. : 14.48 Min. : 237.3 Min. : 236.7
1st Qu.:1981-06-10 00:00:00 1st Qu.: 280.50 1st Qu.: 281.50 1st Qu.: 177.71 1st Qu.: 178.23 1st Qu.: 335.3 1st Qu.: 335.2
Median :1994-11-14 00:00:00 Median : 383.32 Median : 383.50 Median : 234.51 Median : 234.96 Median : 892.6 Median : 896.1
Mean :1994-11-16 03:40:00 Mean : 575.20 Mean : 576.62 Mean : 370.84 Mean : 371.81 Mean : 797.3 Mean : 797.2
3rd Qu.:2008-04-23 00:00:00 3rd Qu.: 841.94 3rd Qu.: 851.50 3rd Qu.: 454.32 3rd Qu.: 456.43 3rd Qu.:1114.1 3rd Qu.:1114.9
Max. :2021-09-29 00:00:00 Max. :2061.50 Max. :2067.15 Max. :1574.37 Max. :1569.59 Max. :1743.8 Max. :1743.4
NA NA’s :1 NA’s :143 NA’s :11 NA’s :154 NA’s :7837 NA’s :7880

Missing values

For similar reasons as with the exchange rates dataset, only one currency will be chosen. This time, it will be USD.AM column (price taken in the morning), because it has only one missing value. It will be replaced with the previous entry.

gold_prices <- gold_prices %>%
  select(Date, USD..AM.) %>%
  fill(USD..AM.)

Quick peek

World Development Indicators

World Development Indicators (WDI) is the primary World Bank collection of development indicators, compiled from officially recognized international sources. It presents the most current and accurate global development data available and includes national, regional, and global estimates.

WDI contains a large host of socio-economic indicators, from the widely used such as population and GDP to the more esoteric such as the percent of households that consume iodized salt.

This dataset contains information about indicators between the years 1970 and 2020 (10608 rows).

In the table summarizing the data below, only three indicator columns are shown for brevity.

Country Name Country Code Region Year AG.LND.PRCP.MM AG.LND.TOTL.K2 AG.LND.TOTL.UR.K2
Length:10608 Length:10608 Length:10608 Min. :1970 Min. : 18.1 Min. : 2 Min. : 0
Class :character Class :character Class :character 1st Qu.:1982 1st Qu.: 591.0 1st Qu.: 17200 1st Qu.: 566
Mode :character Mode :character Mode :character Median :1995 Median :1071.0 Median : 107160 Median : 3395
NA NA NA Mean :1995 Mean :1195.8 Mean : 2719581 Mean : 82862
NA NA NA 3rd Qu.:2008 3rd Qu.:1761.0 3rd Qu.: 547566 3rd Qu.: 15565
NA NA NA Max. :2020 Max. :3240.0 Max. :129956634 Max. :3629312
NA NA NA NA NA’s :9016 NA’s :116 NA’s :10086

Filtering indicators

Since there are many indicators (213 in total) and it would be quite tedious to focus on all of them, some efforts are made to filter them. Indicators with 20% or more of values that are missing are removed entirely.

wdi <- wdi %>% 
  select(where(~mean(is.na(.)) < 0.2))

With this approach, exactly forty indicators are left. That is still a lot, but correlations between them are going to be analyzed to choose which ones will be interesting to examine further.

Missing values

Missing values for the remaining indicators are going to be filled with the same strategy as for previous datasets (with the previous entry).

wdi <- wdi %>% 
  fill(everything(), .direction = "downup")

Correlations in the WDI dataset

For the entire dataset

For each region separately

South Asia

Europe & Central Asia

Middle East & North Africa

East Asia & Pacific

Sub-Saharan Africa

Latin America & Caribbean

North America

We can observe that all graphs share the same, similar triangle for positive correlation and a similar area for negative correlation (although those are more vivid in North America - most likely thanks to the uniqueness of the USA). Because of that, we can conclude that the correlation is analogous for each region. Therefore, the following analysis will focus on the correlation matrix for the entire dataset.

Most correlations that we can notice there are trivial, eg.:

  • population in total with the population in urban or rural areas,
  • male population with the female population,
  • land area with population,
  • methane emission with nitrous oxide emission
  • and more (of similar nature)

What can be quite interesting at a first glance is that there is a strong negative correlation (close to -1.0) between males (% of the total population) and females (% of the total population). But after a closer inspection and a bit of thinking it is quite obvious. When the % of females is increasing, the % of males has to decrease at the same time and vice versa. Those two values cannot grow together.

On average, there used to be slightly more women than men between the years 1970 and 2009. After that year, it changed. 2009 was also a year when Bitcoin was first introduced. Coincidence?

Another curious finding might be that there is a positive correlation (but not that strong - around 0.75) between the number of under-five deaths (number of children dying before reaching age five) and the total population. It would be expected that those indicators grow together similarly (correlation closer to 1.0), but they are not. It might mean that as years go by, medicine is getting more and more advanced. Thanks to that, the number of under-five deaths does not grow as rapidly as the total population.

This source claims that the global under-five mortality rate declined by 59 per cent, from 93 deaths per 1,000 live births in 1990 to 38 in 2019.

Another discovery that is not very surprising (but quite alarming) is that there is a strong correlation (close to 1.0) between a total population and the amount of different gases that countries produce that influence climate change and global warming. This is to be expected - the more people there are, the more houses, factories, machines, etc. we produce.

Given the fact that the world’s population is constantly growing at a very rapid rate, greenhouse gases production is also going to increase as well.

We can notice on the charts above quite a sharp decline in all gases emission after the year 1990. This article on Nature claims that it can be contributed to the Soviet Union’s collapse because it meant many people stopped eating meat. An estimated one-third of late-Soviet cropland has been abandoned after the post-communist economic crisis.

The effect of the COVID-19 pandemic

This section presents a quick overview of the effect that the COVID-19 pandemic had on the world’s economy. There is widespread agreement among economists that it will have severe negative impacts on the global economy.

The world, on average, lost around 5 percent of the gross domestic product in 2020 as we can see above.

To put this number in perspective, global GDP was estimated at around 84.54 trillion U.S. dollars in 2020 – meaning that a 4.5 percent drop in economic growth results in almost 2.96 trillion U.S. dollars of lost economic output.

Predicting the price of a Bitcoin

This section focuses on creating a model that can predict the price of a Bitcoin.

Creating a dataset

The dataset used for training the model will consist of the following attributes:

  • Bitcoin market price (target variable)
  • 3 blockchain related attributes (difficulty, hash rate and trade volume - they were described in the Bitcoin section)
  • Euro price
  • Gold price
  • S.P Composite
  • USA total population
  • USA CO2 emission
  • USA methane emission
  • USA nitrous oxide emission

These last four attributes were chosen because they were described in more detail in the previous section about the correlation. The USA was selected in particular because it is the main producer of those greenhouse gases.

usa_indicators <- wdi %>% 
  filter(`Country Name` == 'United States') %>% 
  select(Year, SP.POP.TOTL, EN.ATM.CO2E.KT, EN.ATM.METH.KT.CE, EN.ATM.NOXE.KT.CE) %>% 
  rename(US_Population = 2, US_CO2_emission = 3, US_Methane_emission = 4, US_Nitrous_oxide_emission = 5)

data <- bchain %>%
  mutate(Year = as.numeric(format(Date, "%Y"))) %>%
  left_join(ex_rates, by="Date") %>% 
  left_join(gold_prices, by="Date") %>% 
  left_join(sp_composite %>% select(Date, S.P.Composite), by="Date") %>%
  left_join(usa_indicators, by="Year") %>%
  select(-Year) %>%
  arrange(Date) %>%
  rename(Gold = USD..AM.)

Filling missing values

First, years 2009 and half of 2010 are going to be discarded, because Bitcoin’s market price for those years is equal to 0. The first day when Bitcoin price rise above zero is 2010-08-18 and therefore this is going to be the starting point in the dataset.

After that, there are still some missing values. For example, S.P Composite is collected only once per month. Those values are going to be filled using the nearby entry strategy.

Moreover, for some reason, trade volume has zero values randomly placed in the dataset. For example, for 2012-11-27 the trade volume is 389726.0, for 2012-11-28 it is 0, then on 2012-11-29, it is back at 331048.97. Hence, those values too are going to be replaced with values from nearby entries.

data <- data %>% 
  filter(Date >= "2010-08-18") %>%
  fill(everything(), .direction = "up") %>%
  fill(everything(), .direction = "down") %>%
  select(-Date)

data[data$Trade_volume == 0, "Trade_volume"] = NA

data <- data %>%
  fill(Trade_volume, .direction = "up") %>%
  fill(Trade_volume, .direction = "down")

The summary of the final dataset is presented in the table below:

Difficulty Hash_rate Market_price Trade_volume Euro Gold S.P.Composite US_Population US_CO2_emission US_Methane_emission US_Nitrous_oxide_emission
Min. :5.120e+02 Min. : 0 Min. : 0.06 Min. :5.100e+01 Min. :1.036 Min. :1051 Min. :1087 Min. :309327143 Min. :4813720 Min. :609200 Min. :242640
1st Qu.:1.215e+07 1st Qu.: 112 1st Qu.: 104.04 1st Qu.:1.313e+06 1st Qu.:1.181 1st Qu.:1250 1st Qu.:1629 1st Qu.:316059947 1st Qu.:4950210 1st Qu.:617170 1st Qu.:246480
Median :1.635e+11 Median : 1197070 Median : 610.38 Median :1.539e+07 Median :1.201 Median :1330 Median :2094 Median :323071755 Median :4981300 Median :620810 Median :249290
Mean :4.198e+12 Mean : 30309571 Mean : 5879.46 Mean :1.681e+08 Mean :1.231 Mean :1427 Mean :2273 Mean :321503885 Mean :5006739 Mean :620646 Mean :248500
3rd Qu.:6.379e+12 3rd Qu.: 45759634 3rd Qu.: 7323.06 3rd Qu.:1.856e+08 3rd Qu.:1.312 3rd Qu.:1621 3rd Qu.:2785 3rd Qu.:326838199 3rd Qu.:5089500 3rd Qu.:622590 3rd Qu.:250060
Max. :2.505e+13 Max. :198514006 Max. :63554.44 Max. :5.352e+09 Max. :1.488 Max. :2062 Max. :4493 Max. :329484123 Max. :5392870 Max. :649940 Max. :254860

Training and tuning the model

To begin with, train and test sets have to be created. Because we deal with time series, regular stratified partitioning with the createDataPartition function would not be quite appropriate. Therefore, manual splitting is performed (on a dataset that is already sorted by date ascending). An 80/20% split is created. The train set will be further divided into train and validation sets.

breaking_point_index <- nrow(data) * 0.8

train <- data[1:breaking_point_index,]
test <- data[(breaking_point_index + 1):nrow(data),]

There are 3253 rows in the training set and 813 in the test set.

As mentioned earlier, we deal with time series, so any kind of cross-validation or similar methods would not be correct. We could have mixed values from different dates in training and validation sets. Luckily, the trainControl method allows us to choose a “timeslice” method that uses rolling forecasting origin techniques, that move the training and test sets in time.

Parameters passed to that method are specified so that there will be 8 windows created in total, where each training set has 800 rows and validation set 200 rows.

fitControl <- trainControl(method = "timeslice", fixedWindow = T, initialWindow = 800, horizon = 200, skip = 300)

Stochastic Gradient Boosting

First, let’s try training using Stochastic Gradient Boosting.

gbmFit <- train(Market_price ~ ., data = train, method = "gbm",  trControl = fitControl, verbose = F)
gbmFit
## Stochastic Gradient Boosting 
## 
## 3253 samples
##   10 predictor
## 
## No pre-processing
## Resampling: Rolling Forecasting Origin Resampling (200 held-out with a fixed window) 
## Summary of sample sizes: 800, 800, 800, 800, 800, 800, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared   MAE     
##   1                   50      1391.364  0.3501754  1184.469
##   1                  100      1294.541  0.2695565  1085.826
##   1                  150      1274.989  0.2955726  1066.256
##   2                   50      1357.842  0.3599908  1135.415
##   2                  100      1344.389  0.3819946  1120.356
##   2                  150      1332.959  0.3838608  1108.177
##   3                   50      1352.317  0.3984152  1123.858
##   3                  100      1346.252  0.4420979  1117.584
##   3                  150      1357.742  0.4455119  1134.790
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  1, shrinkage = 0.1 and n.minobsinnode = 10.

Quantile Random Forest

Another algorithm that will be tested is Quantile Random Forest.

qrfFit <- train(Market_price ~ ., data = train, method = "qrf",  trControl = fitControl)
qrfFit
## Quantile Random Forest 
## 
## 3253 samples
##   10 predictor
## 
## No pre-processing
## Resampling: Rolling Forecasting Origin Resampling (200 held-out with a fixed window) 
## Summary of sample sizes: 800, 800, 800, 800, 800, 800, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    2    1419.542  0.2962583  1192.269
##    6    1452.531  0.3321094  1235.957
##   10    1501.793  0.3169855  1268.325
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.

Comparing the results between models

R squared is not visible because of the scale of the axis, so let’s plot it separately.

As we can see, Stochastic Gradient Boosting performs slightly better in general, and it is chosen as the final model.

Variable importance

Not surprisingly, the most important attribute is one that is related directly to blockchain technology - difficulty. It is something that our intuition would say because the bigger the Bitcoin price, the more people are interested in it, there are more miners and the difficulty of finding a new block grows as well.

What can be surprising is that the S.P Composite attribute is more important than the two remaining blockchain attributes (trade volume and hash rate).

What is quite remarkable here is that the US emission of CO2 is a better predictor of Bitcoin’s price than the Euro or gold price. According to the plot, they play no role in predicting the price (the same goes for the total US population, US methane, and nitrous oxide emissions).

Measuring performance

For measuring the performance of the best model selected on the test set, some standard metrics for regression are used:

  • RMSE (root mean squared error)
  • R squared
  • MAE (mean absolute error)
##         RMSE     Rsquared          MAE 
## 1.975857e+04 3.584703e-01 1.255453e+04

As one can see in the table above, the results are not extraordinary. Even though the R squared value on the test set is similar to the values observed on the training set, the RMSE is a bit higher. The reason for this is most likely the fact that there was a huge increase in Bitcoin’s price in 2021 and this data was not available for the training set. A potential solution would be to include more data so that the model could “see” that rapid growth.

Predicting the price of Bitcoin is hard because there are much more factors affecting it than were captured here in this report. For example, even trivial things like a positive tweet from Elon Musk with regards to cryptocurrency can have a tremendous effect on the market price.

ARIMA model

We can also try some time series modeling with ARIMA.

First, a dataset is going to be created and split into train and test sets. It is going to be devised using the same rules as in the previous section. We don’t need additional features that were used when creating a regressor. Therefore, no additional tables are going to be joined. Only date and market price columns are left. We also start on the date 2010-08-18 because that is the first day when the Bitcoin price rises above 0.

data <- bchain %>%
  select(Date, Market_price) %>%
  filter(Date >= "2010-08-18") %>%
  arrange(Date)

arima_train <- data[1:breaking_point_index, ]
arima_test <- data[(breaking_point_index + 1):nrow(data), ]

Now, since the Arima function only accepts univariate time series, we have to perform some transformations. The Zoo library has a built-in function for converting to a time series vector. We create a time series that changes daily in the specified date range.

arima_train_ts <-
  zoo(arima_train$Market_price,
      seq(
        from = as.Date("2010-08-18"),
        to = as.Date("2019-07-14"),
        by = 1
      ))
arima_test_ts <-
  zoo(arima_test$Market_price,
      seq(
        from = as.Date("2019-07-15"),
        to = as.Date("2021-10-04"),
        by = 1
      ))

Additionally, as we saw in the Bitcoin section when plotting the Bitcoin price, the data is not stationary and this is a requirement for the ARIMA model. To deal with this problem, we can compute differences of prices to stationarize the time series (by specifying the d parameter in the Arima() function - the difference order).

Now, we can fit the ARIMA model by specifying the three parameters (p, q, and d). Luckily, we can test many combinations of those parameters using the auto.arima function.

auto.arima(arima_train_ts, trace = T, d = 1)
## 
##  Fitting models using approximations to speed things up...
## 
##  ARIMA(2,1,2) with drift         : 43448.74
##  ARIMA(0,1,0) with drift         : 43497.45
##  ARIMA(1,1,0) with drift         : 43492.02
##  ARIMA(0,1,1) with drift         : 43491.35
##  ARIMA(0,1,0)                    : 43496.5
##  ARIMA(1,1,2) with drift         : 43481.31
##  ARIMA(2,1,1) with drift         : 43487.25
##  ARIMA(3,1,2) with drift         : 43459.53
##  ARIMA(2,1,3) with drift         : 43431.39
##  ARIMA(1,1,3) with drift         : 43462.41
##  ARIMA(3,1,3) with drift         : Inf
##  ARIMA(2,1,4) with drift         : 43411.11
##  ARIMA(1,1,4) with drift         : 43452.07
##  ARIMA(3,1,4) with drift         : 43407.52
##  ARIMA(4,1,4) with drift         : Inf
##  ARIMA(3,1,5) with drift         : 43408.85
##  ARIMA(2,1,5) with drift         : 43407.98
##  ARIMA(4,1,3) with drift         : 43411.28
##  ARIMA(4,1,5) with drift         : Inf
##  ARIMA(3,1,4)                    : 43406.41
##  ARIMA(2,1,4)                    : 43409.93
##  ARIMA(3,1,3)                    : Inf
##  ARIMA(4,1,4)                    : Inf
##  ARIMA(3,1,5)                    : 43407.77
##  ARIMA(2,1,3)                    : 43430.26
##  ARIMA(2,1,5)                    : 43406.9
##  ARIMA(4,1,3)                    : 43410.15
##  ARIMA(4,1,5)                    : Inf
## 
##  Now re-fitting the best model(s) without approximations...
## 
##  ARIMA(3,1,4)                    : 43414.21
## 
##  Best model: ARIMA(3,1,4)
## Series: arima_train_ts 
## ARIMA(3,1,4) 
## 
## Coefficients:
##          ar1      ar2      ar3      ma1     ma2     ma3     ma4
##       0.1212  -0.7051  -0.3882  -0.0694  0.7423  0.3496  0.1212
## s.e.  0.1099   0.0564   0.1090   0.1098  0.0565  0.1068  0.0192
## 
## sigma^2 estimated as 36654:  log likelihood=-21699.08
## AIC=43414.17   AICc=43414.21   BIC=43462.86

Finally, we can fit the model using the obtained parameters and evaluate the performance by predicting the next 20 values.

fitARIMA <- Arima(arima_train_ts, order = c(3, 1, 4))

Performance

ME RMSE MAE MPE MAPE MASE ACF1
Training set 3.184154 191.2157 57.46084 0.086365 3.716523 1.017451 -0.0013748
Test set -1188.537915 1281.3689 1188.53791 -12.043664 12.043664 21.045272 NA

Although the RMSE error is smaller than the one observed with the regressor created in the previous section, it is not accurate to compare that. The regressor was predicting values from the entire test set (that had 813 values) and the ARIA model predicted only the next 20 days. That also means it did not see that rapid growth of the price in 2021.

This quick experiment showed that sometimes it might make sense to consider using a standard time series modeling algorithm that is faster to implement than a typical model for regression or at least use it as a baseline to compare it with other models.